36 research outputs found
Unsupervised Learning of Depth and Ego-Motion from Video
We present an unsupervised learning framework for the task of monocular depth
and camera motion estimation from unstructured video sequences. We achieve this
by simultaneously training depth and camera pose estimation networks using the
task of view synthesis as the supervisory signal. The networks are thus coupled
via the view synthesis objective during training, but can be applied
independently at test time. Empirical evaluation on the KITTI dataset
demonstrates the effectiveness of our approach: 1) monocular depth performing
comparably with supervised methods that use either ground-truth pose or depth
for training, and 2) pose estimation performing favorably with established SLAM
systems under comparable input settings.Comment: Accepted to CVPR 2017. Project webpage:
https://people.eecs.berkeley.edu/~tinghuiz/projects/SfMLearner
Image-to-Image Translation with Conditional Adversarial Networks
We investigate conditional adversarial networks as a general-purpose solution
to image-to-image translation problems. These networks not only learn the
mapping from input image to output image, but also learn a loss function to
train this mapping. This makes it possible to apply the same generic approach
to problems that traditionally would require very different loss formulations.
We demonstrate that this approach is effective at synthesizing photos from
label maps, reconstructing objects from edge maps, and colorizing images, among
other tasks. Indeed, since the release of the pix2pix software associated with
this paper, a large number of internet users (many of them artists) have posted
their own experiments with our system, further demonstrating its wide
applicability and ease of adoption without the need for parameter tweaking. As
a community, we no longer hand-engineer our mapping functions, and this work
suggests we can achieve reasonable results without hand-engineering our loss
functions either.Comment: Website: https://phillipi.github.io/pix2pix/, CVPR 201
Everybody Dance Now
This paper presents a simple method for "do as I do" motion transfer: given a
source video of a person dancing, we can transfer that performance to a novel
(amateur) target after only a few minutes of the target subject performing
standard moves. We approach this problem as video-to-video translation using
pose as an intermediate representation. To transfer the motion, we extract
poses from the source subject and apply the learned pose-to-appearance mapping
to generate the target subject. We predict two consecutive frames for
temporally coherent video results and introduce a separate pipeline for
realistic face synthesis. Although our method is quite simple, it produces
surprisingly compelling results (see video). This motivates us to also provide
a forensics tool for reliable synthetic content detection, which is able to
distinguish videos synthesized by our system from real data. In addition, we
release a first-of-its-kind open-source dataset of videos that can be legally
used for training and motion transfer.Comment: In ICCV 201
Multi-view Relighting using a Geometry-Aware Network
International audienceWe propose the first learning-based algorithm that can relight images in a plausible and controllable manner given multiple views of an outdoor scene. In particular, we introduce a geometry-aware neural network that utilizes multiple geometry cues (normal maps, specular direction, etc.) and source and target shadow masks computed from a noisy proxy geometry obtained by multi-view stereo. Our model is a three-stage pipeline: two subnetworks refine the source and target shadow masks, and a third performs the final relighting. Furthermore, we introduce a novel representation for the shadow masks, which we call RGB shadow images. They reproject the colors from all views into the shadowed pixels and enable our network to cope with inacuraccies in the proxy and the non-locality of the shadow casting interactions. Acquiring large-scale multi-view relighting datasets for real scenes is challenging, so we train our network on photorealistic synthetic data. At train time, we also compute a noisy stereo-based geometric proxy, this time from the synthetic renderings. This allows us to bridge the gap between the real and synthetic domains. Our model generalizes well to real scenes. It can alter the illumination of drone footage, image-based renderings, textured mesh reconstructions, and even internet photo collections